Gated Recurrent Unit (GRU)

Getting rid of the cell state/memory and use hidden state to transfer information.

Formula

Update Gate: (Forget Old + Incorporate New) $$ \begin{aligned} z_t &= \sigma(\mathbf{W_z} \cdot \begin{bmatrix} w_t & h_{t-1} \end{bmatrix}) \end{aligned} $$ Reset Gate: $$ \begin{aligned} r_t &= \sigma(\mathbf{W_r} \cdot \begin{bmatrix} w_t & h_{t-1} \end{bmatrix}) \end{aligned} $$ Output: $$ \begin{aligned} \tilde{h}_t &= \tanh(\mathbf{W} \cdot \begin{bmatrix} w_t & r_t \odot h_{t-1} \end{bmatrix}) \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \\ &= (\text{Forget Old Info}) + (\text{Incorporate New Info}) \\ \end{aligned} $$

Note that the $h^t$ is similar to the Q-Learning updating rule.

Gates

The difference between GRU and LSTM is that GRU discards the cell state $c_t$ in LSTM and use the hidden vector $h_t$ to pass information. The update gate $z_t$ controls how much past memory and how much new information to use and serves the functions of both the forget gate and the input gate in LSTM, while the reset gate $r_t$ specifically controls the forgetting of past memory.

GRU contains less parameters than LSTM and is thus easier to train while maintaining the same level of perforamnce as LSTM.

Reference

by Jon